In this project, I explored online retail transactions to uncover customer behavior, top-selling products, customer segments, high-value buyers, and cross-selling opportunities. The goal is to provide actionable insights for marketing strategies that enhance retention and boost revenue.
The Online Retail dataset from the UCI Machine Learning Repository contains ~542k transactions from a UK-based non-store retailer between Dec 2010 and Dec 2011. The company mainly sells unique all-occasion gifts, and many customers are wholesalers.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
First, I loaded the dataset, remove missing or invalid transactions, and create a new feature TotalPrice.
df = pd.read_excel("Online Retail.xlsx")
# Drop missing CustomerID and invalid transactions
df = df.dropna(subset=["CustomerID"])
df = df[(df["Quantity"]>0) & (df["UnitPrice"]>0)]
# Convert InvoiceDate to datetime
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])
# Feature: TotalPrice
df["TotalPrice"] = df["Quantity"] * df["UnitPrice"]
# Get the latest date
snapshot_date = df["InvoiceDate"].max()
print("Cleaned dataset shape:", df.shape)
Cleaned dataset shape: (397884, 9)
The dataset contains ~398k transactions across 9 columns.
df.head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | TotalPrice | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
| 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
| 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
To understand revenue drivers, I aggregated sales by product.
top_products = df.groupby("Description")["TotalPrice"].sum().sort_values(ascending=False).head(10)
px.bar(top_products[::-1], x=top_products.values[::-1], y=top_products.index[::-1],
orientation='h', title="Top 10 Products by Revenue").show()
Products like Paper Craft, Little Birdie and Regency Cakestand 3 Tier are key contributors. Marketing campaigns could highlight these best-sellers to boost sales further.
Next, I explored which markets drive revenue.
top_countries = df.groupby("Country")["TotalPrice"].sum().sort_values(ascending=False).head(10)
px.bar(top_countries[::-1], x=top_countries.values[::-1], y=top_countries.index[::-1],
orientation='h', title="Top 10 Countries by Revenue").show()
The UK clearly dominates sales. Marketing and bundling strategies should prioritize UK customers for maximum impact.
Looking at the last six months helps identify seasonality or stock issues.
# Last 6 months
six_months_ago = snapshot_date - pd.DateOffset(months=6)
df_recent = df[df["InvoiceDate"] >= six_months_ago].copy()
df_recent["InvoiceMonth"] = df_recent["InvoiceDate"].dt.to_period("M").astype(str)
# Aggregate monthly sales
monthly_sales = df_recent.groupby("InvoiceMonth")["TotalPrice"].sum().reset_index()
monthly_sales = monthly_sales.sort_values("InvoiceMonth")
# Visualize trend
px.line(monthly_sales, x="InvoiceMonth", y="TotalPrice",
title="Monthly Sales Trend (Last 6 Months)", markers=True).show()
There’s a dip in November revenue, likely due to seasonality or stock issues. This insight can guide inventory planning and seasonal campaigns.
To tailor marketing, I used Recency, Frequency, and Monetary (RFM) metrics to segment customers.
# RFM calculation
rfm = df.groupby("CustomerID").agg({
"InvoiceDate": lambda x: (snapshot_date - x.max()).days,
"InvoiceNo": "nunique",
"TotalPrice": "sum"
}).rename(columns={"InvoiceDate":"Recency","InvoiceNo":"Frequency","TotalPrice":"Monetary"})
# Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[["Recency","Frequency","Monetary"]])
from sklearn.cluster import KMeans
# Elbow method
inertia = [KMeans(n_clusters=k, random_state=42).fit(rfm_scaled).inertia_ for k in range(1,8)]
plt.plot(range(1,8), inertia, 'o-')
plt.xlabel("k"); plt.ylabel("Inertia"); plt.title("Elbow Method")
plt.show()
Elbow analyses suggest 3 clusters as the optimal choice.
# Train K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
rfm["Segment"] = kmeans.fit_predict(rfm_scaled)
# Segment Profile
rfm_profile = rfm.groupby("Segment").agg({"Recency":"mean","Frequency":"mean","Monetary":["mean","count"]}).round(2)
print(rfm_profile)
Recency Frequency Monetary
mean mean mean count
Segment
0 39.98 4.85 2012.11 3231
1 245.02 1.58 631.14 1093
2 6.14 80.21 122888.41 14
# 3D visualization
fig_rfm3d = px.scatter_3d(rfm, x='Recency', y='Frequency', z='Monetary',
color='Segment', size='Monetary', opacity=0.7,
title="3D RFM Segments")
fig_rfm3d.show()
Customer Segments:
I defined high-value customers as the top 25% in total spending and trained a Random Forest model to predict them.
# Define target
q3 = df.groupby("CustomerID")["TotalPrice"].sum().quantile(0.75)
high_value_customers = df.groupby("CustomerID")["TotalPrice"].sum() > q3
# Features
customer_features = df.groupby("CustomerID").agg({
"InvoiceDate": lambda x: (snapshot_date - x.max()).days,
"InvoiceNo": "nunique",
"UnitPrice": "mean",
"Quantity": "mean",
"Country": lambda x: x.mode()[0]
}).rename(columns={"InvoiceDate":"Recency","InvoiceNo":"Frequency","UnitPrice":"AvgUnitPrice","Quantity":"AvgQuantity","Country":"Country"})
customer_features = pd.get_dummies(customer_features, columns=["Country"], drop_first=True)
X = customer_features
y = high_value_customers.astype(int)
# Scaling
X_scaled = scaler.fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,
random_state=42, stratify=y)
from sklearn.ensemble import RandomForestClassifier
# Train Random Forest
clf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced')
clf.fit(X_train, y_train)
# Predict probabilities and classify high-value customers
y_proba = clf.predict_proba(X_test)[:,1]
y_pred = (y_proba >= 0.4).astype(int) # threshold 0.4 to capture more high-value customers
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
print("ROC-AUC:", roc_auc_score(y_test, y_proba).round(3))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
ROC-AUC: 0.943
Confusion Matrix:
[[614 37]
[ 55 162]]
precision recall f1-score support
0 0.92 0.94 0.93 651
1 0.81 0.75 0.78 217
accuracy 0.89 868
macro avg 0.87 0.84 0.85 868
weighted avg 0.89 0.89 0.89 868
With ROC-AUC of 0.943, the model effectively identifies top spenders for focused marketing campaigns.
Using FP-Growth on UK transactions, I identified products frequently bought together.
from mlxtend.frequent_patterns import fpgrowth, association_rules
# Pivot table & filter UK transactions
basket_sets = (df[df['Country']=="United Kingdom"]
.groupby(['InvoiceNo','Description'])['Quantity']
.sum().unstack().loc[:, lambda x: x.sum() > 10] > 0)
# FP-Growth & association rules
frequent_itemsets = fpgrowth(basket_sets, min_support=0.01, use_colnames=True)
rules_top = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2) \
.sort_values('lift', ascending=False).head(10)
rules_top
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | representativity | leverage | conviction | zhangs_metric | jaccard | certainty | kulczynski | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 541 | (HERB MARKER THYME) | (HERB MARKER ROSEMARY) | 0.010753 | 0.010873 | 0.010153 | 0.944134 | 86.829038 | 1.0 | 0.010036 | 17.705365 | 0.999228 | 0.884817 | 0.943520 | 0.938918 |
| 540 | (HERB MARKER ROSEMARY) | (HERB MARKER THYME) | 0.010873 | 0.010753 | 0.010153 | 0.933702 | 86.829038 | 1.0 | 0.010036 | 14.921137 | 0.999350 | 0.884817 | 0.932981 | 0.938918 |
| 914 | (REGENCY TEA PLATE GREEN ) | (REGENCY TEA PLATE ROSES ) | 0.013637 | 0.015980 | 0.011534 | 0.845815 | 52.930211 | 1.0 | 0.011316 | 6.382074 | 0.994671 | 0.637874 | 0.843311 | 0.783810 |
| 915 | (REGENCY TEA PLATE ROSES ) | (REGENCY TEA PLATE GREEN ) | 0.015980 | 0.013637 | 0.011534 | 0.721805 | 52.930211 | 1.0 | 0.011316 | 3.545575 | 0.997040 | 0.637874 | 0.717958 | 0.783810 |
| 617 | (POPPY'S PLAYHOUSE BEDROOM ) | (POPPY'S PLAYHOUSE LIVINGROOM ) | 0.015619 | 0.012556 | 0.010153 | 0.650000 | 51.769856 | 1.0 | 0.009956 | 2.821270 | 0.996244 | 0.563333 | 0.645550 | 0.729306 |
| 616 | (POPPY'S PLAYHOUSE LIVINGROOM ) | (POPPY'S PLAYHOUSE BEDROOM ) | 0.012556 | 0.015619 | 0.010153 | 0.808612 | 51.769856 | 1.0 | 0.009956 | 5.143389 | 0.993153 | 0.563333 | 0.805576 | 0.729306 |
| 928 | (SET OF 3 WOODEN STOCKING DECORATION) | (SET OF 3 WOODEN TREE DECORATIONS) | 0.014959 | 0.013757 | 0.010333 | 0.690763 | 50.211536 | 1.0 | 0.010127 | 3.189279 | 0.994968 | 0.562092 | 0.686450 | 0.720927 |
| 929 | (SET OF 3 WOODEN TREE DECORATIONS) | (SET OF 3 WOODEN STOCKING DECORATION) | 0.013757 | 0.014959 | 0.010333 | 0.751092 | 50.211536 | 1.0 | 0.010127 | 3.957447 | 0.993755 | 0.562092 | 0.747312 | 0.720927 |
| 618 | (POPPY'S PLAYHOUSE LIVINGROOM ) | (POPPY'S PLAYHOUSE KITCHEN) | 0.012556 | 0.017301 | 0.010693 | 0.851675 | 49.225611 | 1.0 | 0.010476 | 6.625290 | 0.992142 | 0.557994 | 0.849063 | 0.734865 |
| 619 | (POPPY'S PLAYHOUSE KITCHEN) | (POPPY'S PLAYHOUSE LIVINGROOM ) | 0.017301 | 0.012556 | 0.010693 | 0.618056 | 49.225611 | 1.0 | 0.010476 | 2.585309 | 0.996934 | 0.557994 | 0.613199 | 0.734865 |
For example, Herb Marker Rosemary is often purchased with Herb Marker Thyme, suggesting bundling opportunities.
From ~398k transactions, we identified top-selling products and confirmed the UK as the main market. Seasonal sales dips suggest opportunities for smarter inventory planning. Customers were segmented into VIPs, mid-tier, and at-risk groups, enabling targeted retention, upselling, and re-engagement. High-value buyers were predicted with a Random Forest model, while FP-Growth revealed frequent product bundles. These insights empower marketing teams to run personalized campaigns, make data-driven decisions, and drive sustainable growth.